# Replication for The Define Combine Procedure

## Setup R environment/packages

-   Tested on R version: 4.3.1. Tested on Mac OS Ventura 13.1 using Mac M1 processor.
-   Simulations were run on the [Boston University Shared Computing Cluster](https://www.bu.edu/tech/support/research/computing-resources/scc/) running R 4.3.1 on Linux and using CPU nodes. Running all of the simulations required approximately 2,500 CPU-hours, but this will vary significantly with CPU speed.
-   We use [`renv`](https://rstudio.github.io/renv/articles/renv.html) to preserve a snapshot of package versions. When you open the project (`DefineCombineProcedure.Rproj`), renv will be activated automatically. You will then have to install the necessary packages. However, we have found that renv has issues installing some packages, particularly `redist` and `redistmetrics`. To avoid these problems, follow these steps:
    1.  Open the project using `DefineCombineProcedure.Rproj`. This will activate the renv environment.
    2.  Open `code/packages.R`. Install the packages in order. You may need to install additional tools to build the `redist` package on your computer. Note that a specific version of the `redist` package is required for perfect replication due to a change in the map generation algorithm.
    3.  The final two lines of `code/packages.R` are essential for setting up one of the algorithms in the `redist` package. See [here](https://alarm-redist.org/redist/reference/redist.init.enumpart.html) for more on the enumeration algorithm.
    4.  Run `renv::status()`. This will report what additional packages need to be installed. Run `renv::install()` to install the additional packages.

## Simulations

### 1. Starting data:

-   `data/redist_inputs` This folder contains map objects for the `redist` package for every state and chamber simulated. The underlying data is based on the [2020 Redistricting Data Files](https://github.com/alarm-redist/census-2020) produced by Christopher T. Kenny and Cory McCartan, which combines election data from the [Voting and Election Science Team (VEST)](https://dataverse.harvard.edu/dataverse/electionscience) with data from the 2020 U.S. Census.
-   `data/source/base_maps_to_generate.csv` This file specifies the parameters for the base maps generated in the next step.
-   `data/source/simulations_to_run.csv` This file specifies the parameters for each simulation run in step #3.
-   `data/source` also includes other files used as starting points for simulations or data used to supplement the results (e.g. 2020 presidential election results; data on redistricting methods by state)

### 2. Generate base maps

To improve the speed of the simulations and allow for reproducible results *after this step*, we generated a set of random starting maps for every state and number of districts (N and 2\*N for unilateral and DCP redistricting). When generating random maps, the `redist::redist_smc` function is not replicable with a random seed when multiple processors are used. Generating starting maps is extremely slow without multiple processors; starting maps for states with large numbers of districts, such as CA, TX, FL, and NY, will take many hours, especially for the DCP starting maps where double the number of districts must be generated. To ensure that our results are replicable, we generate and save starting maps for every state. These starting maps are then used in the following stage. For each state we generate `max(10\*N, 200)` starting maps, where N is the number of districts. The file `data/source/base_maps_to_generate.csv` specifies the population tolerance for each base map.

### 3. Run simulations

There are three separate sets of simulations (main, uniform swing, and national swing). The latter two are close variations of the main simulations, but the input data is modified for the uniform or national swing before the simulations are run. For each set of simulations, there is a script to run the unilateral simulations and a script to run the DCP simulations.

Each script is run on the command line, and takes five arguments:

| Parameter | Detail                                                           | Example |
|-----------|------------------------------------------------------------------|---------|
| state     | two letter, lower case, state abbreviation                       | ia      |
| dist      | "cd", "sd", or "ld"                                              | cd      |
| party     | "d" or "r" for unilateral simulations, "dr" or "rd" for DCP.     | dr      |
| task      | number identifying the simulation                                | 1       |
| seed      | a random seed, predetermined for each simulation for replication | 12345   |

For example the following code, executed in the terminal, will run a single DCP simulation for Georgia congressional districts where the Republicans are the definer and the Democrats are the combiner.

```         
Rscript code/dcp_sims_define_combine.R ga cd rd 1 1234
```

To save the output to a log file:

```         
Rscript code/dcp_sims_define_combine.R ga cd rd 1 1234 > log_file.txt 2>&1
```

The file `data/source/simulations_to_run.csv` specifies the script file, state, district, party, task, and seed for each simulation.

We ran a total of 5,100 simulations. While the simulations for the smallest states run quickly, each simulation can take many hours to run for the larger states. Running all of the simulations will take 1000s of processor-hours. For the purposes of replicating our simulation results, we recommend randomly selecting a set of simulations from `data/source/simulations_to_run.csv` and verifying that the output matches our files.

We have included two scripts for running the simulations:

-   `code/dcp_sims_run_all.R` allows the user to run a subset (or all, but this is not recommended) of the simulations on a single computer. The code utilizes the `parallel` and `foreach` packages to set up a local cluster and utilize multiple processors. In lines 10 and 11, the user can select a single state and task number, and then subset the simulations to those parameters. This will run twelve simulations, covering all the different simulation types and parties.
-   `code/dcp_sims_run_cluster.R` allows the user to submit each simulation as separate job on a cluster that uses the `qsub` command to submit jobs (using the file `code/dcp_sims_run_cluster.qsub`). This script is configured to run on the Boston University Shared Computing Cluster.

### 4. Process simulation results

-   Run `code/dcp_sims_results.R` and `code/dcp_sims_results_stateleg.R`.

## Run Grid Simulations and Additional Results

-   VRA Simulations:
    -   `data/source/vra_simulations_to_run.csv` This file specifies the parameters for the VRA simulations.
    -   Run `code/additional_results/run_dcp_sims_vra.R` (or run `code/additional_results/dcp_sims_vra` to run a single simulation from the command line).
    -   Run `code/additional_results/dcp_sims_results_vra` to process the results.
-   Grid Simulations:
    -   `code/additional_results/grid_analysis_rect.R`
    -   `code/additional_results/grid_analysis_rect_clustering.R`
    -   `code/additional_results/grid_analysis_hex.R`
-   Iowa Simulations:
    -   `code/additional_results/ia_grid_example.R`
-   `code/grid_simulation_functions.R` is a supporting file with functions for generating grid maps.
-   `code/analytic_example.R` This file creates illustrative figures showing the derived results from our analytic example with no geography in the Appendix. These figures are not based on simulation results and are meant to be illustrative for the reader.

## Generate All Figures and Tables

-   Run `code/figures_and_tables.R`
